Who is this for?

Most helpful if you:

  • Use a variety of modeling methods: linear models, generalized linear models, LASSO, trees, random forest, etc.
  • Are familiar with cross-validation.
  • Use the carat package.
  • Are comfortable with the tidyverse.

Also useful if you:

  • Have used some modeling techniques, like lm().
  • Are excited about learning machine learning.

Follow along at https://github.com/llendway/2020_north_tidymodels, using the tidymodels_demo.Rmd file.

What will we cover?

Bird’s eye view of my garden, Image Credit: Google Maps

What will we cover?

Grape tomatoes from my garden

What will we cover?

Machine Learning Process

Libraries

The libraries we will use:

library(tidyverse)         # for reading in data, graphing, and cleaning
library(lubridate)         # for date manipulation
library(tidymodels)        # for modeling
library(moderndive)        # for King County housing data
library(vip)               # for variable importance plots
theme_set(theme_minimal()) # my favorite ggplot2 theme :)

Similar to tidyverse, tidymodels is a collection of packages:

tidymodels_packages()
##  [1] "broom"         "cli"           "crayon"        "dials"        
##  [5] "dplyr"         "ggplot2"       "infer"         "magrittr"     
##  [9] "parsnip"       "pillar"        "purrr"         "recipes"      
## [13] "rlang"         "rsample"       "rstudioapi"    "tibble"       
## [17] "tidytext"      "tidypredict"   "tidyposterior" "tune"         
## [21] "workflows"     "yardstick"     "tidymodels"

The data

According to the house_prices documentation, “This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.”

data("house_prices")

house_prices %>% 
  slice(1:10)

We will model the home price using the other variables in the model.

Exploration

Exploration

Overview of modeling process

Data splitting

set.seed(327) #for reproducibility

# Randomly assigns 75% of the data to training.
house_split <- initial_split(house_prices, 
                             prop = .75)
house_split
## <16210/5403/21613>
#<training/testing/total>

house_training <- training(house_split)
house_testing <- testing(house_split)

Data splitting

Later, we will use 5-fold cross-validation to evaluate the model and tune model parameters.

set.seed(1211) # for reproducibility
house_cv <- vfold_cv(house_training, v = 5)
house_cv 

Data preprocessing and recipe

A variety of step_xxx() functions can be used to do data pre-processing/transforming. Find them all here.

Data preprocessing and recipe

Beginning of the code:

house_recipe <- recipe(price ~ ., #short-cut, . = all other vars
                       data = house_training) %>% 
  # Pre-processing:
  # Remove, redundant to sqft_living and sqft_lot
  step_rm(sqft_living15, sqft_lot15) %>%
  # log sqft variables & price
  step_log(starts_with("sqft"),-sqft_basement, price, 
           base = 10) %>% 

Data preprocessing and recipe

Continuation of the code:

  # new grade variable combines low grades & high grades
  # indicator variables for basement, renovate, and view 
  # waterfront to numeric
  # age of house
  step_mutate(grade = as.character(grade),
              grade = fct_relevel(
                        case_when(
                          grade %in% "1":"6"   ~ "below_average",
                          grade %in% "10":"13" ~ "high",
                          TRUE ~ grade
                        ),
                        "below_average","7","8","9","high"),
              basement = as.numeric(sqft_basement == 0),
              renovated = as.numeric(yr_renovated == 0),
              view = as.numeric(view == 0),
              waterfront = as.numeric(waterfront),
              age_at_sale = year(date) - yr_built)%>% 

Data preprocessing and recipe

Continuation of the code:

  # Remove sqft_basement, yr_renovated, and yr_built
  step_rm(sqft_basement, yr_renovated, yr_built) %>% 
  # Create a month variable
  step_date(date, features = "month") %>% 
  # Make these evaluative variables, not included in modeling
  update_role(all_of(c("id","date","zipcode", 
                       "lat", "long")),
              new_role = "evaluative") %>% 
  # Create indicator variables for factors/character/nominal
  step_dummy(all_nominal(), all_predictors(), 
             -has_role(match = "evaluative"))

Apply recipe and steps

Apply recipe and steps to training dataset, in order to see the output. Notice the names of the variables.

house_recipe %>% 
  prep(house_training) %>%
  juice() 

Defining the model

In order to define our model, we need to do these steps:

  • Define the model type, which is the general type of model you want to fit.
  • Set the engine, which defines the package/function that will be used to fit the model.
  • Set the mode, which is either “regression” for continuous response variables or “classification” for binary/categorical response variables. (Note that for linear regression, it can only be “regression”, so we don’t NEED this step in this case.)
  • (OPTIONAL) Set arguments to tune. We’ll see an example of this later.

Find all available functions from parsnip here. Here is the detail for linear regression.

house_linear_mod <- 
  # Define a linear regression model
  linear_reg() %>% 
  # Set the engine to "lm" (lm() function is used to fit model)
  set_engine("lm") %>% 
  # Not necessary here, but good to remember for other models
  set_mode("regression")

Creating a workflow

This combines the preprocessing and model definition steps.

house_lm_wf <- 
  # Set up the workflow
  workflow() %>% 
  # Add the recipe
  add_recipe(house_recipe) %>% 
  # Add the modeling
  add_model(house_linear_mod)

Workflow output

house_lm_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## ● step_rm()
## ● step_log()
## ● step_mutate()
## ● step_rm()
## ● step_date()
## ● step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Modeling

Use the fit() function to fit the model to training data. Then display the results nicely.

house_lm_fit <- 
  # Tell it the workflow
  house_lm_wf %>% 
  # Fit the model to the training data
  fit(house_training)

# Display the results nicely
house_lm_fit %>% 
  pull_workflow_fit() %>% 
  tidy() %>% 
  mutate_if(is.numeric, ~round(.x,3))

Evaluating model (overview)

To evaluate the model, we will use cross-validation (CV), specifically 5-fold CV.

Evaluating model (code)

house_lm_fit_cv <-
  # Tell it the workflow
  house_lm_wf %>% 
  # Fit the model (using the workflow) to the cv data
  fit_resamples(house_cv)

# rmse for each fold:
house_lm_fit_cv %>% 
  select(id, .metrics) %>% 
  unnest(.metrics) %>% 
  filter(.metric == "rmse")

Evaluating model (code)

# Evaluation metrics averaged over all folds:
collect_metrics(house_lm_fit_cv)
# To show you where the averages come from:
house_lm_fit_cv %>% 
  select(id, .metrics) %>% 
  unnest(.metrics) %>% 
  group_by(.metric, .estimator) %>% 
  summarize(mean = mean(.estimate),
            n = n(),
            std_err = sd(.estimate)/sqrt(n))

Predicting and evaluating testing data

house_lm_test <- 
  # The modeling work flow
  house_lm_wf %>% 
  # Use training data to fit the model and apply it to testing data
  last_fit(house_split)
# performance metrics from testing data
collect_metrics(house_lm_test)
# sample of predictions from testing data
collect_predictions(house_lm_test) %>% 
  slice(1:5)

How will the model be used?

LASSO model - set up model

For some information about the model: https://en.wikipedia.org/wiki/Lasso_(statistics)

The tune() function inside of set_args() tells it that we will tune the penalty parameter later.

house_lasso_mod <- 
  # Define a lasso model 
  # I believe default is mixture = 1 so probably don't need 
  linear_reg(mixture = 1) %>% 
  # Set the engine to "glmnet" 
  set_engine("glmnet") %>% 
  # The parameters we will tune.
  set_args(penalty = tune()) %>% 
  # Use "regression"
  set_mode("regression")

Create the workflow

We use the same recipe as before but use new model.

house_lasso_wf <- 
  # Set up the workflow
  workflow() %>% 
  # Add the recipe
  add_recipe(house_recipe) %>% 
  # Add the modeling
  add_model(house_lasso_mod)

Workflow output

house_lasso_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## ● step_rm()
## ● step_log()
## ● step_mutate()
## ● step_rm()
## ● step_date()
## ● step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

Tuning the penalty parameter

We use the grid_regular() function from the dials library to choose some values of the penalty parameter for us. Alternatively, we could give it a vector of values we want to try.

penalty_grid <- grid_regular(penalty(),
                             levels = 20)
penalty_grid

Tuning the penalty parameter

Use the tune_grid() function to fit the model using cross-validation for all penalty_grid values and evaluate on all the folds.

house_lasso_tune <- 
  house_lasso_wf %>% 
  tune_grid(
    resamples = house_cv,
    grid = penalty_grid
    )

house_lasso_tune

Tuning the penalty parameter

Then look at the cross-validated results. This shows rmse for fold 1 for each penalty value from penalty_grid.

# The rmse for each penalty value for fold 1:
house_lasso_tune %>% 
  select(id, .metrics) %>% 
  unnest(.metrics) %>% 
  filter(.metric == "rmse", id == "Fold1")

Tuning the penalty parameter

# rmse averaged over all folds for 5:
house_lasso_tune %>% 
  collect_metrics() %>% 
  filter(.metric == "rmse") %>% 
  slice(10:14)
# Best tuning parameter by smallest rmse
best_param <- house_lasso_tune %>% 
  select_best(metric = "rmse")
best_param

Tuning the penalty parameter

# Visualize rmse vs. penalty
house_lasso_tune %>% 
  collect_metrics() %>% 
  filter(.metric == "rmse") %>% 
  ggplot(aes(x = penalty, y = mean)) +
  geom_point() +
  geom_line() +
  scale_x_log10() +
  labs(x = "penalty", y = "rmse")

Finalize workflow for best tuned parameter

house_lasso_final_wf <- house_lasso_wf %>% 
  finalize_workflow(best_param)
house_lasso_final_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## ● step_rm()
## ● step_log()
## ● step_mutate()
## ● step_rm()
## ● step_date()
## ● step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.000206913808111479
##   mixture = 1
## 
## Computational engine: glmnet

Evaluate on testing data

Like before, we apply the model to the test data and examine some final metrics. We also show the metrics from the regular linear model.

# Fit model with best tuning parameter(s) to training data and apply to test data
house_lasso_test <- house_lasso_final_wf %>% 
  last_fit(house_split)

# Metrics for model applied to test data
house_lasso_test %>% 
  collect_metrics()
# Compare to regular linear regression results
collect_metrics(house_lm_test)

Thank you & Resources

THANK YOU!

You have bean a wonderful audience!

Close-up of my green beans

THANK YOU!

May your tidymodels endeavors be fruit … er … vegetable-ful?

My garden as of 07/08/2020

Questions?